KMID : 1022420210130040047
|
|
Phonetics and Speech Sciences 2021 Volume.13 No. 4 p.47 ~ p.53
|
|
End-to-end non-autoregressive fast text-to-speech
|
|
Kim Wi-Back
Nam Ho-Sung
|
|
Abstract
|
|
|
Autoregressive Text-to-Speech (TTS) models suffer from inference instability and slow inference speed. Inference instability occurs when a poorly predicted sample at time step t affects all the subsequent predictions. Slow inference speed arises from a model structure that forces the predicted samples from time steps 1 to t-1 to predict the sample at time step t.
In this study, an end-to-end non-autoregressive fast text-to-speech model is suggested as a solution to these problems. The results of this study show that this model's Mean Opinion Score (MOS) is close to that of Tacotron 2 - WaveNet, while this model's inference speed and stability are higher than those of Tacotron 2 - WaveNet. Further, this study aims to offer insight into the improvement of non-autoregressive models.
|
|
KEYWORD
|
|
deep learning, neural network, speech synthesis, Text-to-Speech (TTS)
|
|
FullTexts / Linksout information
|
|
|
|
Listed journal information
|
|
|